{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c042ddbb-c2c9-46ed-b36c-c965c0d7ff5b",
   "metadata": {},
   "source": [
    "# Building Private Q&A Assistant Using Mongo and Open Source Model"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7e6fbce6-fec9-47af-8701-99721eedec50",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "This notebook is designed to demonstrate how to implement a document Question-and-Answer (Q&A) task using SuperDuperDB in conjunction with open-source model and MongoDB. It provides a step-by-step guide and explanation of each component involved in the process.\n",
    "\n",
    "Implementing a document Question-and-Answer (Q&A) system using SuperDuperDB, open-source model, and MongoDB can find applications in various real-life scenarios:\n",
    "\n",
    "1. **Customer Support Chatbots:** Enable a chatbot to answer customer queries by extracting information from documents, manuals, or knowledge bases stored in MongoDB or any other SuperDuperDB supported database using Q&A.\n",
    "\n",
    "2. **Legal Document Analysis:** Facilitate legal professionals in quickly extracting relevant information from legal documents, statutes, and case laws, improving efficiency in legal research.\n",
    "\n",
    "3. **Medical Data Retrieval:** Assist healthcare professionals in obtaining specific information from medical documents, research papers, and patient records for quick reference during diagnosis and treatment.\n",
    "\n",
    "4. **Educational Content Assistance:** Enhance educational platforms by enabling students to ask questions related to course materials stored in a MongoDB database, providing instant and accurate responses.\n",
    "\n",
    "5. **Technical Documentation Search:** Support software developers and IT professionals in quickly finding solutions to technical problems by querying documentation and code snippets stored in MongoDB or any other database supported by SuperDuperDB. We did that!\n",
    "\n",
    "6. **HR Document Queries:** Simplify HR processes by allowing employees to ask questions about company policies, benefits, and procedures, with answers extracted from HR documents stored in MongoDB or any other database supported by SuperDuperDB.\n",
    "\n",
    "7. **Research Paper Summarization:** Enable researchers to pose questions about specific topics, automatically extracting relevant information from a MongoDB repository of research papers to generate concise summaries.\n",
    "\n",
    "8. **News Article Information Retrieval:** Empower users to inquire about specific details or background information from a database of news articles stored in MongoDB or any other database supported by SuperDuperDB, enhancing their understanding of current events.\n",
    "\n",
    "9. **Product Information Queries:** Improve e-commerce platforms by allowing users to ask questions about product specifications, reviews, and usage instructions stored in a MongoDB database.\n",
    "\n",
    "By implementing a document Q&A system with SuperDuperDB, open-source model, and MongoDB, these use cases demonstrate the versatility and practicality of such a solution across different industries and domains.\n",
    "\n",
    "All is possible without zero friction with SuperDuperDB. Now back into the notebook."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f98f1c7ae8e02278",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Prerequisites\n",
    "\n",
    "Before starting the implementation, make sure you have the required libraries installed by running the following commands:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "6858da67-597d-4d98-ae4a-41003bb569f4",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install superduperdb\n",
    "!pip install vllm\n",
    "!pip install sentence_transformers numpy==1.24.4"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "85c1a0f7572c43ba",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Connect to datastore \n",
    "\n",
    "First, we need to establish a connection to a MongoDB datastore via SuperDuperDB. You can configure the `MongoDB_URI` based on your specific setup. \n",
    "Here are some examples of MongoDB URIs:\n",
    "\n",
    "* For testing (default connection): `mongomock://test`\n",
    "* Local MongoDB instance: `mongodb://localhost:27017`\n",
    "* MongoDB with authentication: `mongodb://superduper:superduper@mongodb:27017/documents`\n",
    "* MongoDB Atlas: `mongodb+srv://<username>:<password>@<atlas_cluster>/<database>`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "f42c42cc-af6a-4712-a993-d9c921693819",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/ubuntu/project/superduperdb/env/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
      "  from .autonotebook import tqdm as notebook_tqdm\n",
      "2023-12-28 12:07:10,806\tINFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001B[32m 2023-Dec-28 12:07:10.81\u001B[0m| \u001B[1mINFO    \u001B[0m | \u001B[36mip-172-31-29-75\u001B[0m| \u001B[36m9f988282-31be-4326-977b-8b4ac96b98a6\u001B[0m| \u001B[36msuperduperdb.base.build\u001B[0m:\u001B[36m144 \u001B[0m | \u001B[1mData Client is ready. mongomock.MongoClient('localhost', 27017)\u001B[0m\n",
      "\u001B[32m 2023-Dec-28 12:07:10.81\u001B[0m| \u001B[1mINFO    \u001B[0m | \u001B[36mip-172-31-29-75\u001B[0m| \u001B[36m9f988282-31be-4326-977b-8b4ac96b98a6\u001B[0m| \u001B[36msuperduperdb.base.build\u001B[0m:\u001B[36m162 \u001B[0m | \u001B[1mConnecting to Metadata Client with engine:  mongomock.MongoClient('localhost', 27017)\u001B[0m\n",
      "\u001B[32m 2023-Dec-28 12:07:10.81\u001B[0m| \u001B[1mINFO    \u001B[0m | \u001B[36mip-172-31-29-75\u001B[0m| \u001B[36m9f988282-31be-4326-977b-8b4ac96b98a6\u001B[0m| \u001B[36msuperduperdb.base.datalayer\u001B[0m:\u001B[36m80  \u001B[0m | \u001B[1mBuilding Data Layer\u001B[0m\n"
     ]
    }
   ],
   "source": [
    "from superduperdb import superduper\n",
    "from superduperdb.backends.mongodb import Collection\n",
    "import os\n",
    "\n",
    "mongodb_uri = os.getenv(\"MONGODB_URI\", \"mongomock://test\")\n",
    "\n",
    "# SuperDuperDB, now handles your MongoDB database\n",
    "# It just super dupers your database\n",
    "db = superduper(mongodb_uri, artifact_store='filesystem://./data/')\n",
    "\n",
    "collection = Collection('questiondocs')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "737497f7d5032bf",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Load Dataset\n",
    "\n",
    "In this example, we use the internal textual data from the `superduperdb` project's API documentation. The objective is to create a chatbot that can offer information about the project. You can either load the data from your local project or use the provided data.\n",
    "\n",
    "If you have the SuperDuperDB project locally and want to load the latest version of the API, uncomment the following cell:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "d72a2a52-964f-456e-88b6-040965f5ed1e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import glob\n",
    "import re\n",
    "\n",
    "ROOT = '../docs/hr/content/docs/'\n",
    "\n",
    "STRIDE = 3       # stride in numbers of lines\n",
    "WINDOW = 25       # length of window in numbers of lines\n",
    "\n",
    "files = sorted(glob.glob(f'{ROOT}/**/*.md', recursive=True))\n",
    "\n",
    "def get_chunk_link(chunk, file_name):\n",
    "    # Get the original link of the chunk\n",
    "    file_link = file_name[:-3].replace(ROOT, 'https://docs.superduperdb.com/docs/docs/')\n",
    "    # If the chunk has subtitles, the link to the first subtitle will be used first.\n",
    "    first_title = (re.findall(r'(^|\\n)## (.*?)\\n', chunk) or [(None, None)])[0][1]\n",
    "    if first_title:\n",
    "        # Convert subtitles and splice URLs\n",
    "        first_title = first_title.lower()\n",
    "        first_title = re.sub(r'[^a-zA-Z0-9]', '-', first_title)\n",
    "        file_link = file_link + '#' + first_title\n",
    "    return file_link\n",
    "\n",
    "def create_chunk_and_links(file, file_prefix=ROOT):\n",
    "    with open(file, 'r') as f:\n",
    "        lines = f.readlines()\n",
    "    if len(lines) > WINDOW:\n",
    "        chunks = ['\\n'.join(lines[i: i + WINDOW]) for i in range(0, len(lines), STRIDE)]\n",
    "    else:\n",
    "        chunks = ['\\n'.join(lines)]\n",
    "    return [{'txt': chunk, 'link': get_chunk_link(chunk, file)}  for chunk in chunks]\n",
    "\n",
    "\n",
    "all_chunks_and_links = sum([create_chunk_and_links(file) for file in files], [])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9803aef243ad58c",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "Otherwise, you can load the data from an external source. The text chunks include code snippets and explanations, which will be utilized to construct the document Q&A chatbot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "e587e284-0876-4464-a977-ac97a9070787",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current\n",
      "                                 Dload  Upload   Total   Spent    Left  Speed\n",
      "100  763k  100  763k    0     0  4743k      0 --:--:-- --:--:-- --:--:-- 4713k\n"
     ]
    }
   ],
   "source": [
    "# Use !curl to download the 'superduperdb_docs.json' file\n",
    "!curl -O https://jupyter-sessions.s3.us-east-2.amazonaws.com/superduperdb_docs.json\n",
    "\n",
    "import json\n",
    "from IPython.display import Markdown\n",
    "\n",
    "# Open the downloaded JSON file and load its contents into the 'chunks' variable\n",
    "with open('superduperdb_docs.json') as f:\n",
    "    all_chunks_and_links = json.load(f)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46481dc1-704e-443d-8c1a-53e32978776c",
   "metadata": {},
   "source": [
    "View the chunk content:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "4a88ea46-ff2d-4d7f-8cce-707d73a0b53f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://docs.superduperdb.com/docs/docs/data_integrations/sql#setup\n"
     ]
    },
    {
     "data": {
      "text/markdown": [
       "---\n",
       "\n",
       "sidebar_position: 3\n",
       "\n",
       "---\n",
       "\n",
       "\n",
       "\n",
       "# SQL\n",
       "\n",
       "\n",
       "\n",
       "`superduperdb` supports SQL databases via the [`ibis` project](https://ibis-project.org/).\n",
       "\n",
       "With `superduperdb`, queries may be built which conform to the `ibis` API, with additional \n",
       "\n",
       "support for complex data-types and vector-searches.\n",
       "\n",
       "\n",
       "\n",
       "## Setup\n",
       "\n",
       "\n",
       "\n",
       "The first step in working with an SQL table, is to define a table and schema\n",
       "\n",
       "\n",
       "\n",
       "```python\n",
       "\n",
       "from superduperdb.backends.ibis import dtype, Table\n",
       "\n",
       "from superduperdb import Encoder, Schema\n",
       "\n",
       "\n",
       "\n",
       "my_enc = Encoder('my-enc')\n",
       "\n",
       "\n",
       "\n",
       "schema = Schema('my-schema', fields={'img': my_enc, 'text': dtype('str'), 'rating': dtype('int')})\n",
       "\n",
       "\n",
       "\n",
       "db = superduper()\n",
       "\n",
       "\n",
       "\n",
       "t = Table('my-table', schema=schema)\n"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from IPython.display import *\n",
    "\n",
    "# Assuming 'chunks' is a list or iterable containing markdown content\n",
    "chunk_and_link = all_chunks_and_links[48]\n",
    "print(chunk_and_link['link'])\n",
    "Markdown(chunk_and_link['txt'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f8c4636-88c6-42a4-b471-41be7c20680f",
   "metadata": {},
   "source": [
    "The chunks of text contain both code snippets and explanations, making them valuable for constructing a document Q&A chatbot. The combination of code and explanations enables the chatbot to provide comprehensive and context-aware responses to user queries."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0370732b-0c55-4672-b6be-0830f9a3a755",
   "metadata": {},
   "source": [
    "As usual we insert the data. The `Document` wrapper allows `superduperdb` to handle records with special data types such as images,\n",
    "video, and custom data-types."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a7208ef2-c035-43b9-a624-ade42a06ed09",
   "metadata": {},
   "outputs": [],
   "source": [
    "from superduperdb import Document\n",
    "\n",
    "# Insert multiple documents into the collection\n",
    "insert_ids = db.execute(collection.insert_many([Document(chunk_and_link) for chunk_and_link in all_chunks_and_links]))\n",
    "print(insert_ids[:5])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b299b6f-37ae-46d7-b064-7d368d98d68a",
   "metadata": {},
   "source": [
    "## Create a Vector-Search Index\n",
    "\n",
    "To enable question-answering over your documents, set up a standard `superduperdb` vector-search index using `sentence_transformers` (other options include `torch`, `openai`, `transformers`, etc.)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7930b9a1-1483-4106-873c-d85a3920c64e",
   "metadata": {},
   "source": [
    "A `Model` is a wrapper around a self-built or ecosystem model, such as `torch`, `transformers`, `openai`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "56905f2e-485e-4179-8585-34eac26c0751",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[2023-12-28 12:07:14] sentence_transformers.SentenceTransformer INFO Load pretrained SentenceTransformer: BAAI/bge-large-en-v1.5\n",
      "[2023-12-28 12:07:15] sentence_transformers.SentenceTransformer INFO Use pytorch device: cuda\n"
     ]
    }
   ],
   "source": [
    "import sentence_transformers\n",
    "from superduperdb import Model, vector\n",
    "\n",
    "model = Model(\n",
    "    identifier='embedding', \n",
    "    object=sentence_transformers.SentenceTransformer('BAAI/bge-large-en-v1.5'),\n",
    "    encoder=vector(shape=(1024,)),\n",
    "    predict_method='encode', # Specify the prediction method\n",
    "    postprocess=lambda x: x.tolist(),  # Define postprocessing function\n",
    "    batch_predict=True, # Generate predictions for a set of observations all at once \n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "6bb05a78-263e-4e6f-b429-8e51dbb932b8",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.83it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "vector size:  1024\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\n"
     ]
    }
   ],
   "source": [
    "vector = model.predict('This is a test', one=True)\n",
    "print('vector size: ', len(vector))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4331b81b-c257-4353-aab4-8f601bef78de",
   "metadata": {},
   "source": [
    "A `Listener` essentially deploys a `Model` to \"listen\" to incoming data, computes outputs, and then saves the results in the database via `db`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "c1625dab-6438-494b-b74d-efb58bfc8610",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import the Listener class from the superduperdb module\n",
    "from superduperdb import Listener\n",
    "\n",
    "\n",
    "# Create a Listener instance with the specified model, key, and selection criteria\n",
    "listener = Listener(\n",
    "    model=model,          # The model to be used for listening\n",
    "    key='txt',            # The key field in the documents to be processed by the model\n",
    "    select=collection.find()  # The selection criteria for the documents\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "591dad80-3788-441b-96db-a5bf23a16979",
   "metadata": {},
   "source": [
    "A `VectorIndex` wraps a `Listener`, allowing its outputs to be searchable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "1aa132d0-e6a2-46f6-9eb8-13fbce90ff11",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "1001it [00:00, 19700.99it/s]\n",
      "Batches: 100%|████████████████████████████████████████████████████████| 32/32 [00:11<00:00,  2.76it/s]\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001B[32m 2023-Dec-28 12:07:32.42\u001B[0m| \u001B[1mINFO    \u001B[0m | \u001B[36mip-172-31-29-75\u001B[0m| \u001B[36m9f988282-31be-4326-977b-8b4ac96b98a6\u001B[0m| \u001B[36msuperduperdb.components.model\u001B[0m:\u001B[36m477 \u001B[0m | \u001B[1mAdding 1001 model outputs to `db`\u001B[0m\n"
     ]
    }
   ],
   "source": [
    "# Import the VectorIndex class from the superduperdb module\n",
    "from superduperdb import VectorIndex\n",
    "\n",
    "# Add a VectorIndex to the SuperDuperDB database with the specified identifier and indexing listener\n",
    "_ = db.add(\n",
    "    VectorIndex(\n",
    "        identifier='my-index',        # Unique identifier for the VectorIndex\n",
    "        indexing_listener=listener    # Listener to be used for indexing documents\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "7fde5b17-9d71-4535-aaf6-85f4fa9910e4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'# Anthropic\\n\\n\\n\\n`superduperdb` allows users to work with `anthropic` API models.\\n\\n\\n\\nRead more about this [here](/docs/docs/walkthrough/ai_models#anthropic).'"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Execute a find_one operation on the SuperDuperDB collection\n",
    "document = db.execute(collection.find_one())\n",
    "document.content['txt']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "92948823-0d18-4e1b-b103-f226d6b09e52",
   "metadata": {},
   "outputs": [],
   "source": [
    "from superduperdb.backends.mongodb import Collection\n",
    "from superduperdb import Document as D\n",
    "from IPython.display import *\n",
    "\n",
    "# Define the query for the search\n",
    "# query = 'Code snippet how to create a `VectorIndex` with a torchvision model'\n",
    "query = 'can you explain vector-indexes with `superduperdb`?'\n",
    "\n",
    "# Execute a search using SuperDuperDB to find documents containing the specified query\n",
    "result = db.execute(\n",
    "    collection\n",
    "        .like(D({'txt': query}), vector_index='my-index', n=5)\n",
    "        .find()\n",
    ")\n",
    "\n",
    "# Display a horizontal rule to separate results\n",
    "display(Markdown('---'))\n",
    "\n",
    "# Display each document's 'txt' field and separate them with a horizontal rule\n",
    "for r in result:\n",
    "    display(Markdown(r['txt']))\n",
    "    display(r['link'])\n",
    "    display(Markdown('---'))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0922a0dc623d7bf",
   "metadata": {
    "collapsed": false
   },
   "source": [
    "## Create a LLM Component\n",
    "\n",
    "In this step, a LLM component is created and added to the system. This component is essential for the Q&A functionality:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "abfa4df6-73ac-4d46-8047-011648e24958",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['embedding', 'llm']\n"
     ]
    }
   ],
   "source": [
    "from superduperdb.ext.llm.vllm import VllmModel\n",
    "\n",
    "# Define the prompt for the llm model\n",
    "prompt_template = (\n",
    "    'Use the following description and code snippets about SuperDuperDB to answer this question about SuperDuperDB\\n'\n",
    "    'Do not use any other information you might have learned about other python packages\\n'\n",
    "    'Only base your answer on the code snippets retrieved and provide a very concise answer\\n'\n",
    "    '{context}\\n\\n'\n",
    "    'Here\\'s the question:{input}\\n'\n",
    "    'answer:'\n",
    ")\n",
    "\n",
    "# Create an instance of llm with the specified model and prompt\n",
    "llm = VllmModel(identifier='llm',\n",
    "                 model_name='mistralai/Mistral-7B-Instruct-v0.2', \n",
    "                 prompt_template=prompt_template,\n",
    "                 inference_kwargs={\"max_tokens\":512})\n",
    "\n",
    "# Add the llm instance\n",
    "db.add(llm)\n",
    "\n",
    "# Print information about the models in the SuperDuperDB database\n",
    "print(db.show('model'))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "696ac7bb-eaaf-4bec-9561-603b3c98a736",
   "metadata": {},
   "source": [
    "## Ask Questions to Your Docs\n",
    "\n",
    "Finally, you can ask questions about the documents. You can target specific queries and use the power of MongoDB for vector-search and filtering rules. Here's an example of asking a question:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "fc4a0f6c-9e24-47aa-bc73-7cc4507e94ff",
   "metadata": {},
   "outputs": [],
   "source": [
    "from superduperdb import Document\n",
    "from IPython.display import Markdown\n",
    "\n",
    "def question_the_doc(question):\n",
    "    # Use the SuperDuperDB model to generate a response based on the search term and context\n",
    "    output, sources = db.predict(\n",
    "        model_name='llm',\n",
    "        input=question,\n",
    "        context_select=(\n",
    "            collection\n",
    "                .like(Document({'txt': question}), vector_index='my-index', n=5)\n",
    "                .find()\n",
    "        ),\n",
    "        context_key='txt',\n",
    "    )\n",
    "    \n",
    "    # Get the reference links corresponding to the answer context\n",
    "    links = '\\n'.join(sorted(set([source.unpack()['link'] for source in sources])))\n",
    "    \n",
    "    # Display the generated response using Markdown\n",
    "    return Markdown(output.content + f'\\n\\nrefs: \\n\\n{links}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "8e561926-ac6b-43aa-834d-d64bc1aef672",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 59.04it/s]\n"
     ]
    },
    {
     "data": {
      "text/markdown": [
       "\n",
       "\n",
       "\n",
       "\n",
       "SuperDuperDB is a library for vector-search in Python. It allows users to create vector-indexes on collections of data, and then perform vector-searches on these indexes using the `.like` operator.\n",
       "\n",
       "\n",
       "\n",
       "The process of creating a vector-index involves registering one or more machine learning models with the index, and specifying which data in the collection should be used to calculate vectors for indexing. Once the index is created, it can be queried using either the `pymongo` or `ibis` query APIs, along with the `.like` operator to perform vector-searches.\n",
       "\n",
       "\n",
       "\n",
       "The `.like` operator takes a document as its argument, which is vectorized using the registered models. The resulting vectors are then used to perform a similarity search against the vectors in the index. The results of the search are filtered using standard query operators, such as `find_one()` or `find()`, to return the documents that match both the vector-search results and the standard query conditions.\n",
       "\n",
       "\n",
       "\n",
       "The order in which the standard query conditions and the `.like` operator are applied can be permuted, resulting in two different algorithms for performing vector-searches: finding similar items based on a text query, and then filtering by other conditions, or finding items that match certain conditions, and then finding the similar items within that subset.\n",
       "\n",
       "refs: \n",
       "\n",
       "https://docs.superduperdb.com/docs/docs/fundamentals/vector_search_algorithm#philosophy\n",
       "https://docs.superduperdb.com/docs/docs/walkthrough/vector_search#querying-the--vectorindex--with-the-hybrid-query-api"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "question_the_doc(\"can you explain vector-indexes with `superduperdb`?'\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "1b1d5be3-a101-4f2a-abb2-fd6d907093fc",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 24.73it/s]\n"
     ]
    },
    {
     "data": {
      "text/markdown": [
       " SuperDuperDB supports various databases such as PostgreSQL, MySQL, and SQLite, and AI frameworks like PyTorch, TensorFlow, and Scikit-learn. It also supports AI APIs like OpenAI, Anthrophic, and Cohere.\n",
       "\n",
       "refs: \n",
       "\n",
       "https://docs.superduperdb.com/docs/docs/intro#what-is-superduperdb-\n",
       "https://github.com/SuperDuperDB/superduperdb/blob/main/README.md"
      ],
      "text/plain": [
       "<IPython.core.display.Markdown object>"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "question_the_doc(\"What databases and AI frameworks does SuperDuperDB support?\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b3de340b",
   "metadata": {},
   "source": [
    "## Now you can build an API as well just like we did\n",
    "### FastAPI Question the Docs Apps Tutorial\n",
    "This tutorial will guide you through setting up a basic FastAPI application for handling questions with documentation. The tutorial covers both local development and deployment to the Fly.io platform.\n",
    "https://github.com/SuperDuperDB/chat-with-your-docs-backend"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
